Computer Science - Distributed, Parallel, and Cluster Computing

Subcategories

Papers

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, et al. • • (2023) • DOI: 10.48550/arXiv.2309.06180

High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for eac...